BREAKING THE CODE: Predicting Where Disease Will Strike
نویسنده
چکیده
Background: We applied stochastic search variable selection (SSVS), a Bayesian model selection method, to the simulated data of Genetic Analysis Workshop 13. We used SSVS with the revisited Haseman-Elston method to find the markers linked to the loci determining change in cholesterol over time. To study gene-gene interaction (epistasis) and gene-environment interaction, we adopted prior structures, which incorporate the relationship among the predictors. This allows SSVS to search in the model space more efficiently and avoid the less likely models. Results: In applying SSVS, instead of looking at the posterior distribution of each of the candidate models, which is sensitive to the setting of the prior, we ranked the candidate variables (markers) according to their marginal posterior probability, which was shown to be more robust to the prior. Compared with traditional methods that consider one marker at a time, our method considers all markers simultaneously and obtains more favorable results. Conclusions: We showed that SSVS is a powerful method for identifying linked markers using the Haseman-Elston method, even for weak effects. SSVS is very effective because it does a smart search over the entire model space. Background In this work, we analyzed the slope of the cholesterol increase with age in the simulated data (Problem 2). Our objective was to identify the markers that are linked to the disease genes related to a high rate of increase in cholesterol. Genetic Analysis Workshop 13 provided information that the disease genes are located on chromosomes 7(s7), 15(s8), and 21(s9), respectively, and that the gene on chromosome 21(s9) only affects cholesterol rate in the females, i.e., it interacts with gender. The Haseman-Elston [1] method allowed one to apply linear regression methods for linkage analysis. For each sibling pair, it used the number of alleles identical by descent (IBD) at each marker as the explanatory variables and a statistic measuring similarity of values of the quantitative traits in the sibling pair as the response variable. The original HasemanElston method [1] used the squared difference between the traits of the siblings. In a recent publication, Elston et al. [1] proposed the cross-product of the two trait values in a sib pair as the response, which was used in this paper. Suh et al. [2] applied Stochastic Search Variable Selection (SSVS), a Bayesian variable selection method proposed by from Genetic Analysis Workshop 13: Analysis of Longitudinal Family Data for Complex Diseases and Related Risk Factors New Orleans Marriott Hotel, New Orleans, LA, USA, November 11–14, 2002 Published: 31 December 2003 BMC Genetics 2003, 4(Suppl 1):S69 Genetic Analysis Workshop 13: Analysis of Longitudinal Family Data for Complex Diseases and Related Risk Factors Laura Almasy, Christopher I Amos, Joan E Bailey-Wilson, Rita M Cantor, Cashell E Jaquish, Maria Martinez, Rosalind J Neuman, Jane M Olson, Lyle J Palmer, Stephen S Rich, M Anne Spence, Jean W MacCluer This article is available from: http://www.biomedcentral.com/1471-2156/4/s1/S69 Page 1 of 9 (page number not for citation purposes) BMC Genetics 2003, 4 http://www.biomedcentral.com/1471-2156/4/s1/S69 George and McCulloch [3] for the linear regression model, to the Haseman-Elston method. Although the scope of Suh et al was very preliminary, with only the IBD values at the linked markers plus 10 unlinked markers used as candidate explanatory variables in the variable selection, it showed the Bayesian variable selection approach to be very promising. The study presented here extended these methods in two respects. First, we took advantage of SSVS by including all 399 markers as candidate explanatory variables. It is computationally impossible to consider all subsets of 399 markers using a traditional frequentist approach. Secondly, a hierarchical prior probability structure as discussed by Chipman [4] was imposed on the model space to study the interaction effects (epistasis). The results were reported and compared with those obtained with the more traditional forward and backward step-wise regression. Methods Haseman-Elston method We chose to analyze the rate of change in cholesterol over time in the simulated data. First, for each individual, we obtained the least square (LS) estimate for the slope of cholesterol over the time. For the ith sibling pair, using the LS estimate of slope as the trait (Y1i, Y2i), we computed their cross-product CPi = (Y1i -m)(Y2i m) as our response values, where m is the mean of the slopes over all siblings in the same family. Elston et al. [1] introduced the crossproduct CP, as the replacement of the squared difference . In our regression analysis, we adopted CP as the response, and also used squared-difference for comparison. For simplicity, we assumed the errors to be independent but a correlation structure could be implemented into our method in a straightforward way. There are about 1500 full sib pairs and a few half sib pairs in each replicate. In the replicate we considered there are 1522 full sib pairs. The number of alleles shared in each pair was obtained for each sib pair at each marker using the SIBPAL program of the SAGE software [5]. There were a total of 399 markers. We had where the ε ~ N(0, σ2) were assumed to be independent and X values were IBD scores. To study the effect of gender, we also included the genders of the siblings as an explanatory variable. It was in fact coded as two dummy explanatory variables as follows: (male, male) = (0, 0), (male, female) = (0,1), and (female, female) = (1,1). SSVS George and McCulloch [3] proposed a Bayesian model selection method for variable selection based on the Gibbs sampler. The criterion of interest was taken to be the posterior probability of a model conditional on the data that could be obtained using the stochastic search variable-selection. For the simplest case of linear regression with normal errors: Y = X' β + ε, ε ~ N(0, σ2I), where β may contain main effects or interactions effects. They set the prior distribution of β as mixtures of two normal distributions by introducing the latent variable γ: βk|γk ~ (1 γk) N (0, τ2) + γk N (0, c2 τ2), where much larger variance (c > 1) allowed for γk = 1 to have a large influence. A recommended choice for these parameter values is given by George and McCulloch [3]. The value of c was set equal to 10 in our analysis. A model was represented by a vector γ = (γ1, γ2,..., γp), where γk = 0 or 1. If γk = 0, then the marker Xj was considered to be excluded from the model and if γk = 1, it was considered to be included in the model. Note that β0 was taken to be always included, thus we could set β0 N((0, c2τ2). With appropriate prior on γ = (γ1, γ2,..., γp) and σ2, we obtained a posterior distribution of γ using Gibbs sampling. Therefore, by examining the posterior probability of γ, we identified the optimal model with the largest posterior probability and rank the markers using the marginal distribution of each γk. A prior for γ corresponds to a prior on the model. The commonly used independence prior implies that the importance of any variable is independent of any other variable. In other words, under this prior, each Xi enters the model independently of the other coefficients, with probability p(γi = 1) = 1 p(γi = 0) = pi. A smaller pi can be used to downweight Xi values that are costly or of less interest. For our case, a useful reduction was to set pi = p, in which p is the a priori expected proportion of Xi values in the model. When only main effects but no interaction were considered, the importance of any variable was independent of the importance of any other variable. Thus the independence prior implied that the prior of γ was simply set as prob(γ) = pn, where n is the number of ones in γ. Increased weight on parsimonious models could instead be obtained by setting p small. So in our case, p was set to be small, 0.02 first, and next to see how our method is robust to this choice of p, we chose a new value of p = 0.002 for comparison. The details on the MCMC algorithms can be found in George and McCulloch [3]. We applied SSVS to select markers linked to cholesterol rate from all 399 markers under consideration. Since it is D Y Y i i i 2 1 2 2 = − ( ) Response = + + = ∑ β β ε 0 1 j j j p X , Page 2 of 9 (page number not for citation purposes) BMC Genetics 2003, 4 http://www.biomedcentral.com/1471-2156/4/s1/S69 impractical to track the complete posterior of γ, only the marginal posterior of each marker is obtained. Although both posterior probability of the models and marginal probabilities of each marker are sensitive to the prior settings, especially c and p, we showed that the ranking of the marginal posterior of the markers are not. Figure 2 illustrates the robustness through plots of the ranking of the markers obtained using two different priors p = 0.02 and 0.002. Other prior settings showed similar high correlations in the rankings of the markers. Top ranked markers based upon SSVS Figu e 1 Top ranked markers based upon SSVS The markers are ranked according to their marginal posterior obtained using 10,000 cycles of Gibbs algorithm. Disease loci are located on chromosomes 7, 15, and 21, and gender effect is ranked at the 15th (Replicate 1 of the simulated data). Page 3 of 9 (page number not for citation purposes) BMC Genetics 2003, 4 http://www.biomedcentral.com/1471-2156/4/s1/S69 We followed the Markov chain Monte Carlo (MCMC) algorithms described in Chipman [4] and implemented it using the JAVA programming language. The programs were run on a Linux cluster using Intel processors. The length of the MCMC chain was set to 10,000. The running time was approximately 30 minutes on a single 1.0 GHz CPU under the above specified environment. The first 1000 samples were used as the burn-in period and not included in estimating the posterior. Hierarchical prior structure When interaction effects (epistasis) are considered in the model selection, the model space becomes enormous and the common independence prior for γ is not appropriate anymore. With interactions, the prior for gamma can capture the dependence relation between the importance of a higher order term and those lower order terms from which it was formed. Chipman [4] proposed a hierarchical prior structure for this model space. The importance of the interactions such as XiXj will depend only on whether the main effects Xi and Xj are included in the model. This belief can be expressed by a prior for of the form Rankings of markers with p = 0.02 and p = 0.002 Figure 2 Rankings of markers with p = 0.02 and p = 0.002 Ranking of the markers for two different prior settings (p = 0.02 and p = 0.002) is plotted, which shows its robustness to the choice of p. γ γ γ γ = ( , , ) X X X X i j i j p p p p X X X X X X i j i j i j ( ) ( ) ( ) ( , ). γ γ γ γ γ γ = Page 4 of 9 (page number not for citation purposes) BMC Genetics 2003, 4 http://www.biomedcentral.com/1471-2156/4/s1/S69 The probability that the term Xi Xj is active may take on four different values, depending on the values of the pair . In our analysis, we set (p00, p01, p10, p11) = (0, 0, 0, p). This corresponded to the prior belief that if the interaction effect between two factors exists in a model, the main effects of the two factors must be included in the same model. Our study was conducted in two stages. At the first stage, all 399 candidate markers and gender were the candidate variables in SSVS, but interactions were not considered. At the second stage, SSVS was applied to the same sib-pair responses with the top 30 candidate variables selected from the first stage and their interactions as the candidate variables. Among the third were the gender and 29 markers. This brought the total number of candidate variables in SSVS to 465. We chose only the top 30 variables from the first stage for two reasons. First, it is reasonable to assume that only a few linked loci exist and they should be contained in the top 30. Second, this is the maximum size that the current SSVS algorithm handles comfortably in the second stage. Step-wise regression In order to compare the traditional method to our method, we used a step-wise method based on Akaike information criterion (AIC) [6] to select a formula-based model, which was implemented under R, the "GNU S". The details of this method can be found in the R manual [7]. Results Only the first of the 100 simulated data sets was used. Figure 1 displays the marginal posterior of each marker obtained from SSVS with all 399 markers but no interactions. The marginal posterior was computed from the relative frequency of each markers in the MCMC sample of γ. It clearly showed that the high posterior values are concentrated on chromosomes 7, 15, and 21. Table 1 shows the top 30 markers, a marker from chromosome 7 is rated as most significant, and there are seven, four, and two markers from chromosomes 7, 15, and 21, respectively. The variable gender was ranked as 15th. Table 2 shows the most significant 20 markers obtained from the univariate LS regression and from the step-wise regression. These markers were very much evenly distributed in all chromosomes. When we considered the results from the univariate regression, a marker from chromosome 13 was most significant. One each from chromosome 7 and chromosome 15 were only marginally significant; none from chromosome 21 (where a linked marker was located) are in the top 20 most significant markers. Similar results were obtained when backward and forward step-wise regression methods were used. Among the top 20, only two markers were from chromosome 15, and one each from chromosomes 7 and 21. Also, these two traditional methods failed to locate the gender effect as significant. Pr( , ) γ γ γ X X X X i j i i = 1
منابع مشابه
3D stochastic rock fracture modeling related to strike-slip faults
Fault zones and fault-related fracture systems control the mechanical behaviors and fluid-flow properties of the Earth’s crust. Furthermore, nowadays, modeling is being increasingly used in order to understand the behavior of rock masses, and to determine their characteristics. In this work, fault zones and fracture patterns are reviewed, and also comprehensive studies are carried out on the fr...
متن کاملApplication of Artificial Neural Network and Fuzzy Inference System in Prediction of Breaking Wave Characteristics
Wave height as well as water depth at the breaking point are two basic parameters which are necessary for studying coastal processes. In this study, the application of soft computing-based methods such as artificial neural network (ANN), fuzzy inference system (FIS), adaptive neuro fuzzy inference system (ANFIS) and semi-empirical models for prediction of these parameters are investigated. Th...
متن کاملPredicting the breaking onset of surface water waves
[1] Why do ocean waves break? Understanding this important and obvious property of the ocean surface has been elusive for decades. This paper investigates causes which lead deep-water two-dimensional initially monochromatic waves to break. Individual wave steepness is found to be the single parameter which determines whether the wave will break immediately, never break or take a finite number o...
متن کاملThe Elicitation of the Predatory Labial Strike of Dragonfly Larvae in Response to a Purely Mechanical Stimulus
The predatory labial strike of dragonfly (Aeschna) larvae could be elicited by an artificial water jet stimulus. The larvae showed fair prey-catching ability even when visually deprived, whereas when deprived of mechanosensory and visual information, the ability decreased sharply. Also, immobilized prey were unable to elicit a strike even from intact larvae. The percentage of labial strike (PLS...
متن کاملBreaking Gridlock in Health Policy?; Comment on “A New Synthesis”
Pierre-Gerlier Forest has put forward the case that we are on the brink of a revolution in health policy that will be the result of the interplay of five factors. I would not challenge any of them but would emphasize the need to address socio-economic health inequalities, which have the potential to become a major cost driver in a time of growing economic inequality. To Dr. Forest’s list, I wou...
متن کاملTowards Note-Level Prediction for Networked Music Performance
The Internet allows musicians and other artists to collaborate remotely. However, network latency presents a fundamental challenge for remote collaborators who need to coordinate and respond to each other’s performance in real time. In this paper, we investigate the viability of predicting percussion hits before they have occurred, so that information about the predicted drum hit can be sent ov...
متن کامل